Migrate user_metadata_test.go to testcore.NewEnv#9672
Draft
chaptersix wants to merge 38 commits intotemporalio:mainfrom
Draft
Migrate user_metadata_test.go to testcore.NewEnv#9672chaptersix wants to merge 38 commits intotemporalio:mainfrom
chaptersix wants to merge 38 commits intotemporalio:mainfrom
Conversation
…stcore.NewEnv Replace suite-based test pattern with t.Run subtests using testcore.NewEnv(t) for cluster lifecycle management. Use s.Tv() for test variables and s.Context() for timeout contexts.
## What changed? - Cleanup unused WV3 test code paths - Enable use revision numbers config by default ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks None
## What changed? 1. tests/task_queue_stats_test.go — Remove the `t *testing.T` field from `taskQueueStatsSuite` and replace all `s.t` usages with `s.T()`. Critically, fix `validateDescribeWorkerDeploymentVersion` to use `a.NoError(err)` instead of `require.NoError(s.T(), err)` inside the callback. 2. tests/xdc/history_replication_signals_and_updates_test.go — Fix `pollWorkflowResult` to handle errors from `GetWorkflowExecutionHistory` gracefully: return the error from the inner function, retry with a nil page token on transient errors (e.g. CurrentBranchChanged after conflict resolution), and abort cleanly when the context expires. ## Why? TestTaskQueueStats panic: Calling require.NoError(s.T(), err) inside an EventuallyWithT callback is unsafe. EventuallyWithT runs its callback in a separate goroutine on a ticker. If the outer test context expires and the test completes while this callback is still running, the require.NoError call invokes t.Fail() on a completed test, which panics the entire test binary with panic: Fail in goroutine after TestXxx has completed. The fix uses the CollectT-scoped a.NoError(err) so assertions are buffered and safe. TestConflictResolutionGetResult hang/crash: pollWorkflowResult was calling responseInner.History.Events without checking if responseInner was nil. When GetWorkflowExecutionHistory returns an error (e.g. CurrentBranchChanged after conflict resolution resets the branch token), the response is nil, causing a nil pointer dereference that panics the goroutine. Because the goroutine crashed before sending to workflowResultCh, the main test goroutine blocked on <-workflowResultCh for the entire 30-minute CI timeout. The fix returns the error from the inner function and retries with a nil page token on transient errors, which allows the poll to succeed on the updated branch. ## How did you test it? - [ ] built - [X] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks The XDC fix changes `pollWorkflowResult` to retry indefinitely on any non-context error. If a non-transient error were to occur repeatedly, this could loop until the context expires rather than failing immediately. This is acceptable since the context timeout provides a hard bound, and the test failure message via `c.t.s.NoError(ctx.Err(), ...)` will make the root cause clear.
…poralio#9675) ## Summary - Replace suite-based `AdminTestSuite` with `t.Run` subtests using `testcore.NewEnv()` for better parallelization and simpler test infrastructure - Convert `rebuildMutableStateWorkflowHelper` to accept `*testcore.TestEnv` instead of `*AdminTestSuite` - Pass `EnableChasm` via `testcore.WithDynamicConfig` instead of `OverrideDynamicConfig` ## Test plan - [x] `go build` compiles cleanly - [x] `make lint` passes - [x] `go test -tags disable_grpc_modules,test_dep -run TestAdminTestSuite ./tests/...` passes
…9676) ## Summary - Migrate `activity_api_batch_reset_test.go` from suite-based pattern (`FunctionalTestBase` + `suite.Run`) to `testcore.NewEnv` with `t.Run` subtests - Convert `createWorkflow` receiver method to standalone `createBatchResetWorkflow` function - Replace `s.TaskQueue()` with `s.WorkerTaskQueue()` ## Test plan - [x] `go build -tags disable_grpc_modules,test_dep ./tests/...` - [x] `go test -tags disable_grpc_modules,test_dep -timeout 5m -run TestActivityApiBatchResetClientTestSuite ./tests/...` - [x] `make lint`
…emporalio#9685) ## Summary - Route `ErrClosed` from CHASM scheduler through to V1 fallback, so operations on deleted CHASM schedules fall through to the workflow-backed scheduler stack - Add explicit `ErrClosed` checks in `Scheduler.Update` and `Scheduler.Patch` since `UpdateComponent` does not reject mutations on completed executions - Add sentinel workflow detection in `describeScheduleWorkflow` to return NotFound instead of QueryFailed when the V1 fallback hits a sentinel dummy workflow - Add functional tests for operations on deleted schedules (describe, update, patch, delete) across both CHASM and V1 paths - Add migration test verifying V1-to-V2 migration succeeds when a CHASM schedule with the same ID was previously deleted (closed executions allow business ID reuse) ## Test plan - [x] `TestScheduleUpdateAfterDelete` -- update, patch, delete on closed CHASM schedule via scheduler client - [x] `testDeletedScheduleOperations` in shared schedule suite -- describe returns NotFound for both CHASM and V1 - [x] `TestScheduleMigrationV1ToV2WithClosedV2` -- V1-to-V2 migration with previously deleted V2 schedule - [x] Existing schedule migration tests still pass
…ralio#9382) ## What changed? - Ensure the TQs receive and apply the right version data after revive. - Made delete propagation to always happen serial to other propagations. It ensures all other propagations are cancelled before starting delete propagation. - Deprecate the `deleted` flag in version data and the GC logic around it. Now we use the good old forgetVersion path which immediately removes the version data from TQ. - Ensure version state is reset after revive, in case the recreation happened before workflow close. - Also, now workflows CaN based on SDK suggestion if no pending Signal or Update is present. ## Why? The version could stuck at deleted state from TQ POV if revived before the (now deprecated) GC logic cleans it up. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks None --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…alio#9696) ## What changed? Added support for `IS NULL` and `IS NOT NULL` operators in the ListWorkers query engine. - Implemented `evaluateIsExpr` in `worker_query_engine.go` to handle `IS NULL` / `IS NOT NULL` for string and time fields - For string fields: `IS NULL` matches empty values, `IS NOT NULL` matches non-empty values - For time fields (`StartTime`, `HeartbeatTime`): `IS NULL` matches when the timestamp is not set - Unsupported IS operators (`IS TRUE`, etc.) return a clear error - Updated doc comment with the new supported operators ## Why? To enable filtering workers by whether a field is set or not (e.g., `DeploymentName IS NULL` to find workers without a deployment). ## How did you test it? - [x] built - [x] added new unit test(s) New tests: - String field IS NULL / IS NOT NULL (DeploymentName with empty and populated values) - Time field IS NULL / IS NOT NULL (StartTime, HeartbeatTime with nil and set timestamps) - Composite queries combining IS NULL with AND/OR - Error cases: unknown column rejected, unsupported IS operators (e.g., IS TRUE) rejected
…o#9705) ## Summary - Adds two new metric labels (`temporal_worker_deployment_name` and `temporal_worker_deployment_build_id`) to `approximate_backlog_count` and related backlog metrics - These decompose the existing `worker_version` label (which encodes `deploymentName:buildId` as a single string) into two separate fields - Enables downstream consumers like Kubernetes HPA via prometheus-adapter to match on deployment name and build ID individually, without hitting Kubernetes label length/format constraints ## Cardinality impact **None.** The new labels are a 1:1 decomposition of the existing `worker_version` value — every unique `(deployment_name, build_id)` tuple maps to exactly one `worker_version` string. No new time series are created. The same `BreakdownMetricsByBuildID` dynamic config gates all three labels: | Queue type | `worker_version` | `temporal_worker_deployment_name` | `temporal_worker_deployment_build_id` | |---|---|---|---| | Unversioned | `__unversioned__` | "" | "" | | V3 versioned, gate **off** | `__versioned__` | "" | "" | | V3 versioned, gate **on** | `myDeploy:build1` | `myDeploy` | `build1` | ## Files changed - `common/metrics/tags.go` — New tag constants and constructor functions - `service/matching/physical_task_queue_manager.go` — Tags added to handler (propagates to db.go emission sites) - `service/matching/task_queue_partition_manager.go` — Tags added to logical backlog emission sites + `parseDeploymentFromVersionKey` helper ## Test plan - [ ] Existing matching service tests pass (CI) - [ ] Verify labels appear in metrics output after deployment 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk: changes are limited to metrics tagging and are gated by the existing `BreakdownMetricsByBuildID` flag, but could affect downstream metric queries/dashboards expecting the prior label set. > > **Overview** > Adds two new metrics tags, `temporal_worker_deployment_name` and `temporal_worker_deployment_build_id`, alongside `worker_version`. > > Propagates these tags through matching task queue manager/backlog metric emission (including logical backlog and unload zeroing), extracting deployment/build ID from the version key when applicable and updating the affected matching metrics test expectations. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 87a805d. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
## What changed? Wait for tasks to actually load in fairness migration test. ## Why? The waitForTasks tests only waited for metadata initialization, not for tasks to actually load. We have to wait for them to load too, to prevent a poller from grabbing an active task before draining. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests — ran 200 times in CI - [ ] added new unit test(s) - [ ] added new functional test(s)
…lio#9723) ## What changed? Change `temporal_worker_deployment_name` -> `worker_deployment_name` and `temporal_worker_deployment_build_id` -> `worker_build_id`. ## Why? Found out that External Obs can add the prefix later, as it does for other tags. Also dropped `_deployment_` from the build id tag because the build id has a stronger relation to the worker than the worker deployment, and this is easier to say and type. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Zero risk. We added these labels last night. If we can patch this into OSS (or just merge to main and restart Long Haul) no one will ever see the tags we merged last night. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk change limited to renaming two metrics tag keys; the main impact is dashboards/alerts or external tooling expecting the old tag names. > > **Overview** > Renames the metrics tag keys for worker deployment identifiers in `common/metrics/tags.go`, changing `temporal_worker_deployment_name` to `worker_deployment_name` and `temporal_worker_deployment_build_id` to `worker_build_id` for shorter labels. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 979c85f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
## What changed? Use atomic.Value with Swap to please the race detector for the race between the watchMembership goroutine (writer) and gRPC goroutines (readers). ## Why? Data race in integration test ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks (from @paulnpdev's original commit) Hard to evaluate whether the code utilizing this field correctly handles the real-world race involving mutation of this field in real time, especially as I am not very familiar with this code. Counting on my reviewers to pay close attention here. Here is Claude's analysis: Writer (notifyNexusEndpointsOwnershipChange, single goroutine via watchMembership): Loads the current channel (line 2769) Closes it Stores a new channel (line 2770) Reader A (checkNexusEndpointsOwnership, any gRPC goroutine): Loads the channel atomically (line 2751) Returns it to the caller Reader B (ListNexusEndpoints, uses the returned channel): Calls checkNexusEndpointsOwnership, gets ownershipLostCh (line 2703) Selects on ownershipLostCh in long-poll loop (line 2735) Correctness evaluation: The critical sequence in the writer (lines 2769-2770) is a load-close-store — two separate atomic operations, not one. There's a window between close and store where a concurrent reader could load the already-closed channel. This is actually fine for the intended semantics: If a reader loads the old channel (before or during close): it will see the close signal and correctly report ownership lost. If a reader loads the new channel (after store): it will block until the next ownership change — also correct. If a reader loads the old channel between close and store: it gets a closed channel, which immediately unblocks the select — correctly signaling ownership lost. One real concern: line 2769 does a second Load() to get the channel to close. If two membership changes fired in rapid succession (the channel is buffered with size 1), could notifyNexusEndpointsOwnershipChange race with itself? No — the comment on line 2761 states this method is only called from the single watchMembership goroutine (the for range loop on line 359). So the writer is single-threaded, and the load-close-store sequence is safe from self-races. Verdict: The atomic fix is correct. The close-then-store non-atomicity is benign because any interleaving produces the correct outcome (the closed channel signals ownership loss regardless of when a reader observes it). --------- Co-authored-by: Roey Berman <roey.berman@gmail.com>
…emporalio#9708) Change the default for FrontendNexusRequestHeadersBlacklist from empty to block internal headers (accept-encoding, x-forwarded-for, xdc-redirection, xdc-redirection-api, temporal-nexus-failure-support) that shouldn't propagate to user Nexus handlers.
…oralio#9692) ## What changed? testrunner writes intermediate JUnit XML reports after each completed test attempt and enforces its own total deadline derived from the go test -timeout flag. ## Why? When a test run exceeds the GitHub Actions job timeout (40 min), the entire process group is killed before the testrunner can write its final merged XML, producing a synthetic junit.crash.xml showing `functional-test (crash)`, with no visibility into actual test failures. [Sample run](https://github.com/temporalio/temporal/actions/runs/23567876357) that actually had 108 failures on manual inspection There are two distinct failure modes this addresses: 1. Killed between attempts — attempt 1 completes with failures, the process is killed during attempt 2. Previously: no XML on disk. Now: attempt 1's results survive in the intermediate file. 2. Go test timeout doesn't fire — the test binary hangs past its own `-timeout=35m` (e.g. deadlocked goroutines). Previously: GitHub kills at 40 min with no XML. Now: the testrunner enforces the same 35m deadline itself, kills `gotestsum`, extracts any `--- FAIL:` lines captured from stdout, writes a report, and exits cleanly within the 5-minute buffer before GitHub's kill. ## How did you test it? - [X] built - [ ] run locally and tested manually - [ ] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) ## Potential risks The testrunner's context deadline fires at the same time as go test's own -timeout. If both fire simultaneously, parseTestTimeouts (existing path) and the new `ctx.Err()` check (new path) can both trigger. The `ctx.Err()` check is guarded to only activate if the normal timeout-panic path didn't already break the loop, so there is no double-processing.
…lio#9377) ## What changed? Add archetype metric tag and logical task processing metrics, and add `chasm_pure_task_requests` and `chasm_pure_task_errors` metrics. Archetype metric tab will be applied for all tasks executed, physical and logical, that get added in Executable.go by default. If the archetype is found, the archetype display name will be used for the tag value. Otherwise, will default to `workflow` archetype. For chasm logical pure task metrics, will resolve to `__unknown__` if archetype is not found, since this would be unexpected behavior. ## Why? Components and therefore task types can be used across different archetype executions. In the future with more archetypes, there needs to be visibility for which archetype created the task for execution. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s)
## What changed? Moved decision which d2 layout engine to use into file. ## Why? Allows each file to pick the right layout engine for itself. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…poralio#9743) ## What changed? * Upgrade the `actions/create-github-app-token` and `actions/setup-go` because of deprecations notices for NodeJS * Change the test summary to `!cancelled` so successful runs will still show the failed tests in Summary. ## Why? Deprecations warnings for the first bullet point. For the second bullet point it's helpful to see the flaky tests on a successful run. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Tests only
## What changed? This PR implements the server-side plumbing for Principal Attribution - attaching a server-computed, immutable caller identity (Principal) to workflow history events. * Extended the Authorizer Result struct to include a Principal struct * Added the authorization interceptor logic to propagate Principal via gRPC metadata headers when enabled via dynamic config (frontend.enablePrincipalPropagation) * The default authorizer sets principal based on sub JWT claim * On the history service side, Principal is read from context headers and stamped onto every history event * Principal is set per-operation at two entry points: UpdateWorkflowWithNew (existing workflows) and NewWorkflowWithSignal (new workflows) * Added Principal headers to the Nexus disallowed operation headers blocklist to prevent leaking internal identity to external endpoints ## Why? The existing identity field on history events is client-supplied and spoofable. Principal Attribution provides a trusted, server-computed identity derived from authentication context (e.g. JWT claims, mTLS certificates), enabling reliable audit trails for workflow operations. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) I ran a simple test with local JWT server (PR here: temporalio#9647). There a test client sends a signal to a running workflow using a separate JWT identity, then waits for the workflow to complete. Here is the workflow history: | eventId | eventType | principal | |---------|-----------|-----------| | 1 | WORKFLOW_EXECUTION_STARTED | jwt/workflow-starter | | 2 | WORKFLOW_TASK_SCHEDULED | jwt/workflow-starter | | 3 | WORKFLOW_TASK_STARTED | jwt/workflow-starter | | 4 | WORKFLOW_TASK_COMPLETED | jwt/workflow-starter | | 5 | WORKFLOW_EXECUTION_SIGNALED | jwt/signal-sender | | 6 | WORKFLOW_TASK_SCHEDULED | jwt/signal-sender | | 7 | WORKFLOW_TASK_STARTED | jwt/workflow-starter | | 8 | WORKFLOW_TASK_COMPLETED | jwt/workflow-starter | | 9 | ACTIVITY_TASK_SCHEDULED | jwt/workflow-starter | | 10 | ACTIVITY_TASK_STARTED | jwt/workflow-starter | | 11 | ACTIVITY_TASK_COMPLETED | jwt/workflow-starter | | 12 | WORKFLOW_TASK_SCHEDULED | jwt/workflow-starter | | 13 | WORKFLOW_TASK_STARTED | jwt/workflow-starter | | 14 | WORKFLOW_TASK_COMPLETED | jwt/workflow-starter | | 15 | WORKFLOW_EXECUTION_COMPLETED | jwt/workflow-starter | ## Potential risks - Feature-gated: Principal propagation is behind frontend.enablePrincipalPropagation dynamic config (default off), so no impact unless explicitly enabled. - Nexus blocklist addition: Adding principal-type/principal-name to the Nexus disallowed headers blocklist could reject operations that happen to use these header names
…alio#9493) ## What changed? In some cases for the gcloud connector the writer could be left open if the io.Copy call failed. This condition occurs when the gcloud client returns an error due to 429s or other error responses. In this case the writer is never closed, when the intention of the code is just to ensure the most relevant error is surfaced. This change ensures that in this case the correct error is surfaced and an attempt is made to close the writer. ## Why? A flurry of 503s and 429s from google made me look at this code, and I became aware this may be a memory leak ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks This is a minor change, very low risk on an error path --------- Co-authored-by: Vladyslav Simonenko <vlad.simonenko@temporal.io>
## What changed? Exclude polling cases from get apis ## Why? There are long polling cases in the Get APIs. We have trouble to identify those cases in regular Get APIs. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? Added `parallelsuite` package to provide an alternative to testify's suite. **For reviewer: Use "Hide Whitespace" to make the diff smaller.** ## Why? 1. **Parallel by default**: all test methods and subtests run in parallel automatically 2. **IDE support**: GoLand run gutters, navigation, and test discovery work on Test* methods 4. **Misuse detection**: guard catches mixing assertions with Run(), both directions, at runtime 5. **Type-safe subtests**: `s.Run` callbacks receive the concrete suite type, so suite helpers are available 6. **No shared state**: each test method and subtest gets its own fresh copy ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s)
) ## What changed? Add new review guidelines for test reliability and fix instances found in the current codebase. ## Why? I have found these to be the most common patterns across the last two weeks of targeting flaky tests. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Minimal
## What changed? Adds `buf format` as a Makefile target; and integrates it into `make fmt`. All `.proto` changes are from running `make fmt`. ## Why? Consistent protobuf file style. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? Deprecate `WithSdkWorker` testEnv option and init it lazily instead. ## Why? Requires less manual setup. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) Co-authored-by: Alex Stanfield <13949480+chaptersix@users.noreply.github.com>
## What changed? As titled. ## Why? Migrating to the latest and greatest, allowing parallel test runs. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ``` longtran@Longs-MacBook-Pro:~/work/temporal (branch: migrate-to-testenv!) [1] $ go test ./tests/ -run TestLinksTestSuite ✘ ok go.temporal.io/server/tests 1.148s ```
Automatically generated by the optimize-test-sharding workflow. Co-authored-by: Temporal Data <commander-data@temporal.io>
## What changed? WISOTT ## Why? Safety and runtime increases from parallelizing tests. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA
## What changed? Changing `s`s to `env`s. ie migrates some of the tests that were already using testEnv but not yet parallelsuites. **Tip: Use "Hide Whitespace" for review** ## Why? Progress is inevitable. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed and why? ### **Security** - Check namespaces in batch workflow ([CVE-2026-5199](https://www.cve.org/cverecord?id=CVE-2026-5199), LOW) ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Tightens namespace validation for privileged batcher activities to prevent cross-namespace request spoofing; mistakes here could block legitimate batch jobs or still leave gaps if other call paths use unvalidated namespaces. > > **Overview** > Adds strict namespace isolation checks to the batcher worker: `BatchActivityWithProtobuf` now validates that `NamespaceId` *and* any provided `Request.Namespace`/`AdminRequest.Namespace` match the worker’s bound namespace, and consistently uses that bound namespace for downstream frontend calls (including reset-by-type). > > Adds targeted tests: new unit tests to reject mismatched namespace strings and to assert `startTaskProcessor` uses the worker namespace for signals, plus a functional test ensuring a batch terminate in one namespace does not affect workflows in another. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 3a52696. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Sean Kane <sean.kane@temporal.io>
temporalio#9091) ## What changed? Remove nil search attributes from history events / mutable state / chasm node ## Why? Search attributes can be explicitly set as null on the following events: * Workflow Start * Upsert Search Attributes * Continue-as-new Starting a workflow or continuing a workflow as new with a nil search attribute records the `{key: nil}` mapping in mutable state and the history event. Upserting search attributes erases any current search attribute value from persistence (mutable state and visibility record). Upserting the search attribute as nil in the SDK retains the mapping in the WorkflowInfo as a {key: nil} mapping. On a continue-as-new event, SDK sends this mapping to Server, and the search attribute gets repopulated with the null search attribute in the history event and mutable state. This last scenario can cause inconsistency between the expected search attributes and what is recorded in history. After upserting a nil search attribute, the mutable state and visibility record no longer shows this search attribute. But, once continuing as new, the search attribute reappears, leading to unexpected NDE if the search attribute is also deregistered. ## CHASM change This also removes nil search attributes from CHASM nodes on replacement. This mirrors the change to Workflow History Events and Mutable State. ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [X] added new functional test(s)
## What changed? Replace plain `int` with `atomic.Int32` in two VersionWorkflowSuite unit tests. ## Why? CI caught this as a crash in the unit test job Failure [here](https://github.com/temporalio/temporal/actions/runs/23856744785) ``` WARNING: DATA RACE Read at 0x00c000b005e8 by goroutine 320: go.temporal.io/server/service/worker/workerdeployment.(*VersionWorkflowSuite).Test_MultipleSyncStates_BlocksCaNUntilAllComplete.func2() version_workflow_test.go:1240 +0x2c Previous write at 0x00c000b005e8 by goroutine 326: ...version_workflow_test.go:1240 +0x3c ``` ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA
## Why? Security Check namespaces in scheduler workflow [CVE-2026-5199](https://www.cve.org/cverecord?id=CVE-2026-5199)
## What changed? 1. Updated Go SDK to latest version per https://pkg.go.dev/go.temporal.io/sdk?tab=versions, latest I see today is `v1.41.1` from Mar 17 2026 2. Ran `go mod tidy` 3. Some tests were failing due to SDK interface changing, ran `cd common/testing/mocksdk && go generate` -- generated changes in `common/testing/mocksdk/client_mock.go` 4. Fixed Nexus test suite due to breaking changes and deprecated APIs in new SDK -- changes in `tests/nexus_workflow_test.go` ## Why? To use the latest version of our SDK and keeping tests up-to-date. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
Only replicating as part of a namespace creation, not on namespace update. ## What changed? Explicitly including ReplicationState in the replication of namespace creation ## Why? Currently ReplicationState is not replicated on namespace creation, leading to cases where the Active cluster is state:Normal and the Standby cluster is state:Undefined". ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks New section of code to me, hard to assess but this contains minimal changes. Testing: 1. make start-dependencies 2. make install-schema-xdc 3. make start-xdc-cluster-a 4. make start-xdc-cluster-b 5. temporal --address 127.0.0.1:8233 operator cluster upsert --frontend-address "127.0.0.1:8233" --enable-connection --enable-replication 6. temporal --address 127.0.0.1:8233 operator cluster upsert --frontend-address "127.0.0.1:7233" --enable-connection --enable-replication 7. temporal --address 127.0.0.1:8233 operator namespace list 8. temporal --address 127.0.0.1:7233 operator namespace create --global --active-cluster cluster-a --cluster cluster-a --cluster cluster-b -n demo_namespace 9. temporal --address 127.0.0.1:8233 operator namespace list Output for 9: Branch: % temporal --address 127.0.0.1:8233 operator namespace list ReplicationConfig.State Normal Main: % temporal --address 127.0.0.1:8233 operator namespace list ReplicationConfig.State Unspecified --------- Co-authored-by: stuart-wells <stuart.well@temporal.io>
## What changed? Instead of dropping expired tasks immediately, pass them through mergeTasksLocked and treat them as already-acked, so that the read level and ack level can be updated appropriately. ## Why? As the previous comment suggests, reading a batch of entirely expired tasks would lead to incorrect behavior: since the acks were not placed in outstandingTasks, the read level wouldn't be updated, and we'd re-read the expired tasks in a loop, i.e. the reader would be stuck. (Unless a task was written within that range, that would unstick it.) Even if we read a batch that wasn't all expired, but ended in a long stretch of expired tasks, we wouldn't update the read level as much as we could. This adds expired tasks as acks, and then adds a call to advanceAckLevelLocked, so the read level can be pushed up, and we can advance the ack level over a stretch of expired tasks as well. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) – and tested it fails without the fix - [ ] added new functional test(s)
…on (temporalio#9774) ## What changed? Reorder code so that the non-blocking `ReadComponent` call uses the original context, and context with the altered deadline is used for the long-poll only ## Why? It looks like it was a mistake; the code state after this commit appears to be what was originally intended. ## How did you test it? - [x] covered by existing tests ## Potential risks Could break describe for standalone activity if this change was misconceived.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/user_metadata_test.gofromtestify/suite-basedFunctionalTestBasepattern totestcore.NewEnv()subteststestvars.New()withs.Tv()andtestcore.NewContext()withs.Context()in subtestsTest plan
make lintpassesgo test -tags functional -run TestUserMetadata ./tests/ -count 1 -v-- all 3 subtests pass